Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenVINO integration for CausalLM models #17

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

helena-intel
Copy link

OpenVINO integration for text-generation-inference.

Known limitations:

  • Seq2Seq models are not supported yet in this integration. This will be added later.
  • Only CPU device is supported at the moment. I want to test and add GPU support in a later PR. Should I add an environment variable OPENVINO_DEVICE or is there a better way?

It would be great to have a documented option to build the Docker image without GPU dependencies and flash-attention, maybe with a make cpubuild option for example. make build-test-image works fine with this integration.

@helena-intel helena-intel force-pushed the openvino-support-causallm branch from 0692b1c to 76a44fa Compare December 12, 2023 15:47
model_path: str,
model_class: Union[AutoModelForCausalLM, AutoModelForSeq2SeqLM],
dtype: torch.dtype,
quantize: Optional[str], # not used by OpenVINO

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to consider quantize parameter as a trigger to compress the model weights to INT8 or INT4?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantize is currently used for bitsandbytes and GPTQ and using anything else throws an error. We could presumably modify that, but for weight compression it seemed load_in_8bit (which is now the default) and soon load_in_4bit would be a better fit.

Personally I would always compress offline and load the compressed model directly. TGIS requires downloaded weights. So if you want to compress the model on the fly, you would have to download the full precision weights, keep them on disk, and then within TGIS compress the model to 4 or 8 bit every time, which takes several minutes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @helena-intel. I agree that offline compression is better. But I noticed you allow on-fly conversion here based only logic below where you add a flag kwargs["export"] = True when it is not model_is_ov. That is why I ask about on-fly compression for such models as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point! Note that at the moment we already do on-the-fly compression for models with more than 1B parameters because they will be converted to 8-bit by optimum-intel. But it would be good to allow to configure this, especially now that we'll have load_in_4bit in optimum-intel soon. How do you propose to include this? Add a "weight_compression" option for quantize in addition to bitsandbytes and gptq? Or weight_compression_int4 and weight_compression_int8? Currently setting dtype_str to int8 also enables bitsandbytes quantization, so I thought we could use the same, but that doesn't out of the box allow int4 because dtype-str is limited to torch dtypes. That can all be changed, but I would like to get a maintainer's opinion on the best way to do this first.

Another option could be to add an environment variable OPENVINO_WEIGHT_FORMAT and allow specifying an exact config for sym/asym, group size and ratio. Which is the most flexible, but a different API than other inference engines.

@helena-intel helena-intel force-pushed the openvino-support-causallm branch from 76a44fa to 6349d91 Compare February 2, 2024 13:40
@helena-intel helena-intel force-pushed the openvino-support-causallm branch from 6349d91 to 3fc754e Compare February 2, 2024 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants